Informed Initial Policies for Learning in Dec-POMDPs

نویسندگان

  • Landon Kraemer
  • Bikramjit Banerjee
چکیده

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a formal model for planning in cooperative multi-agent systems where agents operate with noisy sensors and actuators and local information. While many techniques have been developed for solving DecPOMDPs exactly and approximately, they have been primarily centralized and reliant on knowledge of the model parameters. In real world scenarios, the model may not be known a priori, and a centralized computation method may not be feasible or desirable. Simple distributed reinforcement learning, particularly Qlearning (Sutton and Barto 1998), can address both of the above limitations. Q-learning agents can learn mappings from their own action-observation histories (Hi) to their own actions (Ai), via a quality function (Qi : Hi×Ai 7→ R) that evaluates the long-term effects of selecting an action after observing a history. We investigate two approaches to applying Q-learning to Dec-POMDPs: one in which agents learn concurrently (QConc), and one in which agents take turns to learn the best responses to each other’s policies (Q-Alt). In both Q-Conc and Q-Alt, an agent that is learning follows a dynamic policy (i.e. a policy that changes as the agent learns); however, in Q-Alt, each agent that is not currently learning follows a static policy. Thus, in Q-Alt an agent that has not yet learned a policy needs an initial policy to follow. There are simple ways to choose an arbitrary initial policy in constant time, e.g., choosing a pure policy at random, but such methods do not consider the transition, observation, and reward structure of the Dec-POMDP. As the main contribution of this paper, we propose a simple and principled approach to building an initial joint policy that is based upon the transition and reward (but not the observation) functions of a Dec-POMDP. To lay the groundwork, we first discuss how to compute this initial joint policy in a centralized, model-based fashion. We then discuss how agents can learn such policies in a distributed, model-free manner, and then demonstrate for two benchmark problems that Q-Alt initialized with this informed policy (QIP-Alt) produces better joint policies than Q-Conc and Q-Alt initial-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Planning for Factored Infinite-Horizon DEC-POMDPs

Decentralized partially observable Markov decision processes (DEC-POMDPs) are used to plan policies for multiple agents that must maximize a joint reward function but do not communicate with each other. The agents act under uncertainty about each other and the environment. This planning task arises in optimization of wireless networks, and other scenarios where communication between agents is r...

متن کامل

Learning for Decentralized Control of Multiagent Systems in Large, Partially-Observable Stochastic Environments

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general framework for multiagent sequential decision-making under uncertainty. Although Dec-POMDPs are typically intractable to solve for real-world problems, recent research on macro-actions (i.e., temporally-extended actions) has significantly increased the size of problems that can be solved. However, current...

متن کامل

The Cross-Entropy Method for Policy Search in Decentralized POMDPs

Decentralized POMDPs (Dec-POMDPs) are becoming increasingly popular as models for multiagent planning under uncertainty, but solving a Dec-POMDP exactly is known to be an intractable combinatorial optimization problem. In this paper we apply the Cross-Entropy (CE) method, a recently introduced method for combinatorial optimization, to Dec-POMDPs, resulting in a randomized (sampling-based) algor...

متن کامل

Learning for Multiagent Decentralized Control in Large Partially Observable Stochastic Environments

This paper presents a probabilistic framework for learning decentralized control policies for cooperative multiagent systems operating in a large partially observable stochastic environment based on batch data (trajectories). In decentralized domains, because of communication limitations, the agents cannot share their entire belief states, so execution must proceed based on local information. D...

متن کامل

Influence-Based Policy Abstraction for Weakly-Coupled Dec-POMDPs

Decentralized POMDPs are powerful theoretical models for coordinating agents’ decisions in uncertain environments, but the generally-intractable complexity of optimal joint policy construction presents a significant obstacle in applying Dec-POMDPs to problems where many agents face many policy choices. Here, we argue that when most agent choices are independent of other agents’ choices, much of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012